The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable?
To this end, we present IllusionVQA : a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice Visual Question Answering tasks - comprehension and soft localization.
GPT4V, the best performing VLMs, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro on the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.
IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in 12 optical illusion categories. Each instance consists of an image with an optical illusion, a question, and 3 to 6 options, one of which is the correct answer. We refer to this task as IllusionVQA-Comprehension. The second task tests how well VLMs can differentiate geometrically impossible objects from ordinary objects when two objects are presented side by side. The task consists of 1000 instances following a similar format to the first task. We refer to this task as IllusionVQA-Soft-Localization .
One examples for each type of illusion in IllusionVQA-Comprehension
Samples questions from IllusionVQA-Comprehension
Samples questions from IllusionVQA-Soft-Localization
# | Model | Source | Date | ALL | Impossible Object | Real-Scene | Size | Hidden | Deceptive Design | Angle Illusion | Color | Edited-Scene | Upside-Down | Pos.-Neg. Space | Circle-Spiral | Miscellaneous |
- | Human Performance* | Link | 2024-02-25 | 91.03 | 98.51 | 98.44 | 63.04 | 100 | 94.59 | 84.62 | 60.87 | 100 | 100 | 100 | 66.67 | 89.47 |
1 | GPT4V (4-shot) | Link | 2024-02-25 | 62.99 | 58.96 | 54.69 | 69.57 | 46.67 | 72.97 | 84.62 | 82.61 | 80.95 | 71.43 | 85.71 | 33.33 | 42.11 |
2 | GPT4V (0-shot) | Link | 2024-02-25 | 58.85 | 55.22 | 57.81 | 58.70 | 51.11 | 70.27 | 69.23 | 69.57 | 71.43 | 71.43 | 57.14 | 50 | 42.11 |
3 | Gemini (4-shot) | Link | 2024-02-25 | 52.87 | 56.72 | 46.88 | 52.17 | 48.89 | 67.56 | 50 | 17.39 | 66.67 | 57.14 | 71.43 | 33.33 | 57.89 |
4 | Gemini (0-shot) | Link | 2024-02-25 | 51.26 | 56.72 | 46.88 | 45.65 | 42.22 | 64.86 | 53.85 | 17.39 | 66.67 | 57.14 | 85.71 | 33.33 | 52.63 |
5 | LLaVa (0-shot) | Link | 2024-02-25 | 40 | 43.28 | 42.19 | 19.57 | 42.22 | 43.24 | 38.46 | 26.09 | 61.90 | 71.43 | 42.86 | 0.00 | 42.11 |
6 | Cog (0-shot) | Link | 2024-02-25 | 38.16 | 44.03 | 34.38 | 13.04 | 42.22 | 45.95 | 30.77 | 30.43 | 42.86 | 71.43 | 71.43 | 16.67 | 42.11 |
7 | I-BLIP (0-shot) | Link | 2024-02-25 | 34.25 | 34.22 | 26.56 | 26.09 | 44.44 | 37.84 | 30.77 | 30.43 | 42.86 | 42.86 | 57.41 | 33.33 | 36.84 |
The columns represent the different types of illusions in the dataset.
🚨 To submit your results to the leaderboard, please send to this email with your result json files.
🚨 For more submission details, please refer to this link
Coming soon!
@misc{shahgir2024illusionvqa,
title={IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models},
author={Haz Sameen Shahgir and Khondker Salman Sayeed and Abhik Bhattacharjee and Wasi Uddin Ahmad and Yue Dong and Rifat Shahriyar},
year={2024},
eprint={2403.15952},
archivePrefix={arXiv},
primaryClass={cs.CV}
}